ascene-and object-aware transformer
SOAT: AScene-and Object-Aware Transformer for Vision-and-Language Navigation
A.1 Limitations We propose an approach which exploits object features in addition to scene features for vision-andlanguage navigation (VLN). Our approach is able to utilize object features for better visiolinguistic alignment (see Section 5) despite the domain gap between the images used to train the object detector and VLN data. Specifically, object features are obtained using a Faster R-CNN detector [1] trained on photos from web (Visual Genome [2]), in which objects are typically well framed by the photographer. On the other hand, the VLN datasets used in our experiments contain panoramic images from indoor house scans that capture objects at viewing angles determined by the navigation path. The gap between these two types of data could be eliminated by either fine-tuning or training detector directly on indoor scenes.